The Best 347 Audio Classification Tools in 2025
Mms Lid 126
A language identification model fine-tuned from Facebook's Massively Multilingual Speech project, supporting audio classification for 126 languages
Audio Classification
Transformers Supports Multiple Languages

M
facebook
2.1M
26
Wav2vec2 Base Finetuned Speech Commands V0.02
Apache-2.0
This model is a voice command recognition model fine-tuned on the speech_commands dataset based on facebook/wav2vec2-base, achieving an accuracy of 97.59%.
Audio Classification
Transformers

W
0xb1
1.2M
0
Whisper Medium Fleurs Lang Id
Apache-2.0
A speech language identification model fine-tuned on OpenAI Whisper-medium, achieving 88.05% accuracy on the FLEURS dataset
Audio Classification
Transformers

W
sanchit-gandhi
590.30k
14
Wav2vec2 Large Robust 12 Ft Emotion Msp Dim
This model is fine-tuned from Wav2Vec2-Large-Robust for speech emotion recognition, predicting values in three dimensions: arousal, dominance, and valence.
Audio Classification
Transformers English

W
audeering
394.51k
109
Lang Id Voxlingua107 Ecapa
Apache-2.0
A speech language identification model based on the SpeechBrain framework and ECAPA-TDNN architecture, supporting recognition and speech embedding extraction for 107 languages.
Audio Classification Supports Multiple Languages
L
speechbrain
330.01k
112
Ast Finetuned Audioset 10 10 0.4593
Bsd-3-clause
The Audio Spectrogram Transformer (AST) is a model fine-tuned on AudioSet, which converts audio into spectrograms and applies a vision transformer for audio classification.
Audio Classification
Transformers

A
MIT
308.88k
311
Whisper Small Ft Common Language Id
Apache-2.0
A general language identification model fine-tuned based on openai/whisper-small, achieving 88.6% accuracy on the evaluation dataset
Audio Classification
Transformers

W
sanchit-gandhi
256.20k
2
Emotion Recognition Wav2vec2 IEMOCAP
Apache-2.0
Speech emotion recognition using fine-tuned wav2vec2 model, trained on IEMOCAP dataset
Audio Classification English
E
speechbrain
237.65k
131
Ast Finetuned Audioset 14 14 0.443
Bsd-3-clause
An audio spectrogram transformer fine-tuned on the AudioSet dataset, which converts audio into spectrograms and processes them using a vision transformer architecture, achieving excellent performance in audio classification tasks.
Audio Classification
Transformers

A
MIT
194.20k
5
Wav2vec2 Large Xlsr 53 Gender Recognition Librispeech
Apache-2.0
Gender recognition model fine-tuned on Librispeech-clean-100 dataset, achieving an F1 score of 0.9993 on the test set
Audio Classification
Transformers

W
alefiury
182.33k
42
Wav2vec English Speech Emotion Recognition
Apache-2.0
English speech emotion recognition model fine-tuned based on Wav2Vec 2.0, capable of recognizing 7 different emotions
Audio Classification
Transformers

W
r-f
139.06k
19
Hubert Large Speech Emotion Recognition Russian Dusha Finetuned
Apache-2.0
This model is a Russian speech emotion recognition model fine-tuned on the HuBERT architecture, trained on the DUSHA dataset, capable of identifying emotional states such as neutral, anger, positivity, and sadness.
Audio Classification
Transformers Other

H
xbgoose
111.13k
13
MERT V1 95M
MERT-v1-330M is an advanced music understanding model trained based on the MLM paradigm, with 330M parameters, supporting a 24K Hz audio sampling rate and 75 Hz feature rate, suitable for various music information retrieval tasks.
Audio Classification
Transformers

M
m-a-p
83.72k
32
Audiobox Aesthetics
Unified automatic quality assessment model for speech, music, and sound
Audio Classification
Safetensors
A
facebook
56.27k
24
Mms Lid 256
This is a speech language identification model based on the Wav2Vec2 architecture, capable of recognizing 256 languages, and is part of Facebook's Massively Multilingual Speech (MMS) project.
Audio Classification
Transformers Supports Multiple Languages

M
facebook
48.38k
10
Wav2vec2 Large Robust 24 Ft Age Gender
This model takes raw audio signals as input and outputs age predictions and gender probabilities (child/female/male), along with the pooled state of the last transformer layer.
Audio Classification
Transformers

W
audeering
44.13k
33
Wav2vec2 Lg Xlsr En Speech Emotion Recognition
Apache-2.0
A speech emotion recognition model fine-tuned on Wav2Vec 2.0, capable of identifying 8 English emotions with an accuracy of 82.23% on the RAVDESS dataset
Audio Classification
Transformers

W
ehcalabres
39.83k
221
Wav2vec2 Base Superb Er
Apache-2.0
This is a speech emotion recognition model based on the Wav2Vec2 architecture, adapted from the S3PRL project, designed to identify emotional categories in speech.
Audio Classification
Transformers English

W
superb
28.14k
11
SER Odyssey Baseline WavLM Multi Attributes
MIT
A multi-attribute speech emotion recognition baseline model based on WavLM architecture, predicting arousal, dominance, and valence dimensions
Audio Classification
Transformers English

S
3loi
23.09k
7
Wav2vec2 Large Robust 6 Ft Age Gender
This model, fine-tuned from Wav2Vec2-Large-Robust, can predict the speaker's age and gender from raw audio.
Audio Classification
Transformers

W
audeering
19.29k
2
MERT V1 330M
MERT-v1-330M is an advanced music understanding model trained based on the MLM paradigm, with a parameter scale of 330M, supporting 24K Hz audio sample rate, and suitable for various music information retrieval tasks.
Audio Classification
Transformers

M
m-a-p
16.92k
65
Voice Gender Classifier
MIT
A pre-trained model based on the ECAPA-TDNN architecture for classifying gender from human speech
Audio Classification
Transformers

V
JaesungHuh
14.01k
16
Voice Safety Classifier
A voice content safety detection model based on WavLM base plus architecture, used to identify toxic content in voice chats
Audio Classification
Transformers

V
Roblox
11.55k
37
Hubert Base Superb Ks
Apache-2.0
This model is a keyword spotting model based on the Hubert architecture, designed to classify speech segments into predefined keyword sets.
Audio Classification
Transformers English

H
superb
11.29k
8
Ast Finetuned Speech Commands V2
Bsd-3-clause
An audio spectrogram transformer model fine-tuned on the Speech Commands v2 dataset for audio classification tasks, achieving 98.12% accuracy.
Audio Classification
Transformers

A
MIT
10.94k
15
Hubert Large Superb Er
Apache-2.0
An emotion recognition model based on Hubert-Large pre-trained model for predicting emotion categories in speech
Audio Classification
Transformers English

H
superb
10.24k
21
Voxlingua107 Epaca Tdnn
Apache-2.0
ECAPA-TDNN architecture spoken language identification model trained on the VoxLingua107 dataset, supporting recognition of 107 languages
Audio Classification Other
V
TalTechNLP
10.21k
28
AST VoxCelebSpoof Synthetic Voice Detection
MIT
A synthetic speech detection model fine-tuned based on MIT/ast-finetuned-audioset-10-10-0.4593, demonstrating outstanding performance on the VoxCelebSpoof dataset
Audio Classification
Transformers English

A
MattyB95
9,518
4
Hubert Base Superb Er
Apache-2.0
This model is an emotion recognition model based on the Hubert-Base architecture, trained on the SUPERB emotion recognition task for speech emotion classification
Audio Classification
Transformers English

H
superb
7,887
20
Speech Emotion Recognition With Openai Whisper Large V3
Apache-2.0
This project utilizes the Whisper model for speech emotion recognition, capable of classifying audio into different emotional categories such as happiness, sadness, and surprise.
Audio Classification
Transformers

S
firdhokk
7,750
33
Wav2vec2 Xlsr Persian Speech Emotion Recognition
Apache-2.0
This is a Persian speech emotion recognition model based on the Wav2Vec 2.0 architecture, capable of identifying six basic emotional states.
Audio Classification
Transformers Other

W
m3hrdadfi
5,114
8
Voice Safety Classifier V2
Multilingual voice toxicity detection model based on WavLM architecture, supporting 8 languages and identifying 6 types of violations
Audio Classification
Transformers Supports Multiple Languages

V
Roblox
5,073
4
Wav2vec Vm Finetune
Apache-2.0
A voicemail detection model fine-tuned based on facebook/wav2vec2-xls-r-300m, specifically designed to distinguish between voicemail greetings and live human responses.
Audio Classification
Transformers English

W
jakeBland
5,000
5
Wav2vecbert2 Filledpause
Apache-2.0
A model for classifying 20-millisecond audio frames to detect filler pauses (e.g., 'eee', 'errm', etc.)
Audio Classification Other
W
classla
4,290
0
Mms Lid 4017
This is a speech language identification model based on the Wav2Vec2 architecture, capable of recognizing 4017 languages, and is part of Facebook's Massively Multilingual Speech project.
Audio Classification
Transformers Supports Multiple Languages

M
facebook
3,721
8
Wav2vec2 Base Lang Id
Apache-2.0
A speech language identification model fine-tuned on the common_language dataset based on facebook/wav2vec2-base
Audio Classification
Transformers

W
anton-l
3,470
7
Music Genres Classification
Apache-2.0
This model is trained on facebook/wav2vec2-base-960h for music genre classification tasks, supporting recognition of 10 genres.
Audio Classification
Transformers

M
dima806
3,409
27
Ssast Small Patch Audioset 16 16
Bsd-3-clause
Audio classification model pre-trained on AudioSet and Librispeech, using vision transformer architecture to process audio spectrograms
Audio Classification
Transformers

S
Simon-Kotchou
2,408
1
Accent Id Commonaccent Ecapa
MIT
This model uses the ECAPA-TDNN architecture to classify 16 accents in English speech and is trained on the CommonAccent dataset, achieving a test accuracy of 87%.
Audio Classification English
A
Jzuluaga
2,291
15
Deepfake Audio Detection V2
Apache-2.0
A Deepfake audio detection model fine-tuned on audio folder datasets, achieving 99.73% accuracy
Audio Classification
Transformers

D
MelodyMachine
2,289
14
Wav2vec2 Base Audioset
Audio representation learning model based on HuBERT architecture, pre-trained on the complete AudioSet dataset
Audio Classification
Transformers

W
ALM
2,191
0
Musical Instrument Detection
Apache-2.0
A foundational speech recognition model based on the wav2vec 2.0 architecture, pre-trained on 960 hours of English speech data
Audio Classification
Transformers

M
dima806
2,109
7
- 1
- 2
- 3
- 4
- 5
- 6
- 9